Gulf of Mexico
Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs
Hong, Feng, Yu, Geng, Ye, Yushi, Huang, Haicheng, Zheng, Huangjie, Zhang, Ya, Wang, Yanfeng, Yao, Jiangchao
Diffusion Large Language Models (DLLMs) have emerged as a compelling alternative to Autoregressive models, designed for fast parallel generation. However, existing DLLMs are plagued by a severe quality-speed trade-off, where faster parallel decoding leads to significant performance degradation. We attribute this to the irreversibility of standard decoding in DLLMs, which is easily polarized into the wrong decoding direction along with early error context accumulation. To resolve this, we introduce Wide-In, Narrow-Out (WINO), a training-free decoding algorithm that enables revokable decoding in DLLMs. WINO employs a parallel draft-and-verify mechanism, aggressively drafting multiple tokens while simultaneously using the model's bidirectional context to verify and re-mask suspicious ones for refinement. Verified in open-source DLLMs like LLaDA and MMaDA, WINO is shown to decisively improve the quality-speed trade-off. For instance, on the GSM8K math benchmark, it accelerates inference by 6$\times$ while improving accuracy by 2.58%; on Flickr30K captioning, it achieves a 10$\times$ speedup with higher performance. More comprehensive experiments are conducted to demonstrate the superiority and provide an in-depth understanding of WINO.
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > Canada > British Columbia > Vancouver (0.04)
- Asia > Singapore (0.04)
- (4 more...)
Benchmarking for Practice: Few-Shot Time-Series Crop-Type Classification on the EuroCropsML Dataset
Reuss, Joana, Macdonald, Jan, Becker, Simon, Gikalo, Ekaterina, Schultka, Konrad, Richter, Lorenz, Körner, Marco
Accurate crop-type classification from satellite time series is essential for agricultural monitoring. While various machine learning algorithms have been developed to enhance performance on data-scarce tasks, their evaluation often lacks real-world scenarios. Consequently, their efficacy in challenging practical applications has not yet been profoundly assessed. To facilitate future research in this domain, we present the first comprehensive benchmark for evaluating supervised and SSL methods for crop-type classification under real-world conditions. This benchmark study relies on the EuroCropsML time-series dataset, which combines farmer-reported crop data with Sentinel-2 satellite observations from Estonia, Latvia, and Portugal. Our findings indicate that MAML-based meta-learning algorithms achieve slightly higher accuracy compared to supervised transfer learning and SSL methods. However, compared to simpler transfer learning, the improvement of meta-learning comes at the cost of increased computational demands and training time. Moreover, supervised methods benefit most when pre-trained and fine-tuned on geographically close regions. In addition, while SSL generally lags behind meta-learning, it demonstrates advantages over training from scratch, particularly in capturing fine-grained features essential for real-world crop-type classification, and also surpasses standard transfer learning. This highlights its practical value when labeled pre-training crop data is scarce. Our insights underscore the trade-offs between accuracy and computational demand in selecting supervised machine learning methods for real-world crop-type classification tasks and highlight the difficulties of knowledge transfer across diverse geographic regions. Furthermore, they demonstrate the practical value of SSL approaches when labeled pre-training crop data is scarce.
- Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.92)
- Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (0.76)
Growing with Your Embodied Agent: A Human-in-the-Loop Lifelong Code Generation Framework for Long-Horizon Manipulation Skills
Meng, Yuan, Sun, Zhenguo, Fest, Max, Li, Xukun, Bing, Zhenshan, Knoll, Alois
Large language models (LLMs)-based code generation for robotic manipulation has recently shown promise by directly translating human instructions into executable code, but existing approaches are limited by language ambiguity, noisy outputs, and limited context windows, which makes long-horizon tasks hard to solve. While closed-loop feedback has been explored, approaches that rely solely on LLM guidance frequently fail in extremely long-horizon scenarios due to LLMs' limited reasoning capability in the robotic domain, where such issues are often simple for humans to identify. Moreover, corrected knowledge is often stored in improper formats, restricting generalization and causing catastrophic forgetting, which highlights the need for learning reusable and extendable skills. To address these issues, we propose a human-in-the-loop lifelong skill learning and code generation framework that encodes feedback into reusable skills and extends their functionality over time. An external memory with Retrieval-Augmented Generation and a hint mechanism supports dynamic reuse, enabling robust performance on long-horizon tasks. Experiments on Ravens, Franka Kitchen, and MetaWorld, as well as real-world settings, show that our framework achieves a 0.93 success rate (up to 27% higher than baselines) and a 42% efficiency improvement in feedback rounds. It can robustly solve extremely long-horizon tasks such as "build a house", which requires planning over 20 primitives. Code will be open-sourced upon acceptance. Y ou should try to preserve the previous functionality . Large language models (LLMs) and vision-language models (VLMs) have become integral to robotic manipulation due to their robust commonsense knowledge and advanced reasoning capabilities. Early approaches Co-Reyes et al. (2018); Lynch et al. (2023); Liu et al. (2023) relied on language embeddings conditioned within reinforcement learning or imitation learning to align robot actions with human commands. These methods often struggled with limited data efficiency and poor generalization. With the rapid progress of LLMs such as GPT, a natural direction has been to integrate them into the pipeline for task decomposition and language grounding Zhang et al. (2023); Huang et al. (2023); Guo et al. (2024). In this setting, an LLM decomposes a complex manipulation task into sub-tasks and invokes a pre-trained language-conditioned policy to execute low-level primitives. This approach assumes that the pre-trained policy can carry out each motion precisely, yet in practice, this is rarely possible due to environmental perturbations and imperfect policy design. Another direction for advancing human-level robotic manipulation is to adopt LLM or VLM backbones for large-scale pretraining on robotic data, creating end-to-end vision-language-action (VLA) foundation models Kim et al. (2024); Black et al. (2024); Bjorck et al. (2025).
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.05)
- North America > Montserrat (0.04)
- Asia > China > Beijing > Beijing (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
MELEGROS: Monolithic Elephant-inspired Gripper with Optical Sensors
Trunin, Petr, Cafiso, Diana, Nardin, Anderson Brazil, Exley, Trevor, Beccai, Lucia
The elephant trunk exemplifies a natural gripper where structure, actuation, and sensing are seamlessly integrated. Inspired by the distal morphology of the African elephant trunk, we present MELEGROS, a Monolithic ELEphant-inspired GRipper with Optical Sensors, emphasizing sensing as an intrinsic, co-fabricated capability. Unlike multi-material or tendon-based approaches, MELEGROS directly integrates six optical waveguide sensors and five pneumatic chambers into a pneumatically actuated lattice structure (12.5 mm cell size) using a single soft resin and one continuous 3D print. This eliminates mechanical mismatches between sensors, actuators, and body, reducing model uncertainty and enabling simulation-guided sensor design and placement. Only four iterations were required to achieve the final prototype, which features a continuous structure capable of elongation, compression, and bending while decoupling tactile and proprioceptive signals. MELEGROS (132 g) lifts more than twice its weight, performs bioinspired actions such as pinching, scooping, and reaching, and delicately grasps fragile items like grapes. The integrated optical sensors provide distinct responses to touch, bending, and chamber deformation, enabling multifunctional perception. MELEGROS demonstrates a new paradigm for soft robotics where fully embedded sensing and continuous structures inherently support versatile, bioinspired manipulation.
- North America > United States (0.04)
- South America > Brazil (0.04)
- Europe > Italy (0.04)
- (9 more...)
MedVSR: Medical Video Super-Resolution with Cross State-Space Propagation
Liu, Xinyu, Sun, Guolei, Wang, Cheng, Yuan, Yixuan, Konukoglu, Ender
High-resolution (HR) medical videos are vital for accurate diagnosis, yet are hard to acquire due to hardware limitations and physiological constraints. Clinically, the collected low-resolution (LR) medical videos present unique challenges for video super-resolution (VSR) models, including camera shake, noise, and abrupt frame transitions, which result in significant optical flow errors and alignment difficulties. Additionally, tissues and organs exhibit continuous and nuanced structures, but current VSR models are prone to introducing artifacts and distorted features that can mislead doctors. T o this end, we propose Med-VSR, a tailored framework for medical VSR. It first employs Cross State-Space Propagation (CSSP) to address the imprecise alignment by projecting distant frames as control matrices within state-space models, enabling the selective propagation of consistent and informative features to neighboring frames for effective alignment. Moreover, we design an Inner State-Space Reconstruction (ISSR) module that enhances tissue structures and reduces artifacts with joint long-range spatial feature learning and large-kernel short-range information aggregation. Experiments across four datasets in diverse medical scenarios, including endoscopy and cataract surgeries, show that MedVSR significantly outperforms existing VSR models in reconstruction performance and efficiency.
- Asia > China > Hong Kong (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > Mexico > Gulf of Mexico (0.04)
- (2 more...)
Graph Variate Neural Networks
Roy, Om, Moshfeghi, Yashar, Smith, Keith
Modelling dynamically evolving spatio-temporal signals is a prominent challenge in the Graph Neural Network (GNN) literature. Notably, GNNs assume an existing underlying graph structure. While this underlying structure may not always exist or is derived independently from the signal, a temporally evolving functional network can always be constructed from multi-channel data. Graph Variate Signal Analysis (GVSA) defines a unified framework consisting of a network tensor of instantaneous connectivity profiles against a stable support usually constructed from the signal itself. Building on GVSA and tools from graph signal processing, we introduce Graph-Variate Neural Networks (GVNNs): layers that convolve spatio-temporal signals with a signal-dependent connectivity tensor combining a stable long-term support with instantaneous, data-driven interactions. This design captures dynamic statistical interdependencies at each time step without ad hoc sliding windows and admits an efficient implementation with linear complexity in sequence length. Across forecasting benchmarks, GVNNs consistently outperform strong graph-based baselines and are competitive with widely used sequence models such as LSTMs and Transformers. On EEG motor-imagery classification, GVNNs achieve strong accuracy highlighting their potential for brain-computer interface applications.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > California > Los Angeles County (0.04)
- (2 more...)
LogicGuard: Improving Embodied LLM agents through Temporal Logic based Critics
Gokhale, Anand, Srivastava, Vaibhav, Bullo, Francesco
Large language models (LLMs) have shown promise in zero-shot and single step reasoning and decision making problems, but in long horizon sequential planning tasks, their errors compound, often leading to unreliable or inefficient behavior. We introduce LogicGuard, a modular actor-critic architecture in which an LLM actor is guided by a trajectory level LLM critic that communicates through Linear Temporal Logic (LTL). Our setup combines the reasoning strengths of language models with the guarantees of formal logic. The actor selects high-level actions from natural language observations, while the critic analyzes full trajectories and proposes new LTL constraints that shield the actor from future unsafe or inefficient behavior. LogicGuard supports both fixed safety rules and adaptive, learned constraints, and is model-agnostic: any LLM-based planner can serve as the actor, with LogicGuard acting as a logic-generating wrapper. We formalize planning as graph traversal under symbolic constraints, allowing LogicGuard to analyze failed or suboptimal trajectories and generate new temporal logic rules that improve future behavior. To demonstrate generality, we evaluate LogicGuard across two distinct settings: short-horizon general tasks and long-horizon specialist tasks. On the Behavior benchmark of 100 household tasks, LogicGuard increases task completion rates by 25% over a baseline InnerMonologue planner. On the Minecraft diamond-mining task, which is long-horizon and requires multiple interdependent subgoals, LogicGuard improves both efficiency and safety compared to SayCan and InnerMonologue. These results show that enabling LLMs to supervise each other through temporal logic yields more reliable, efficient and safe decision-making for both embodied agents.
- North America > Mexico > Gulf of Mexico (0.28)
- North America > United States > Michigan (0.04)
- North America > United States > California (0.04)
- (3 more...)
- Law (0.93)
- Materials > Metals & Mining > Diamonds (0.66)
- Materials > Metals & Mining > Iron (0.48)
Runge-Kutta Approximation and Decoupled Attention for Rectified Flow Inversion and Semantic Editing
Chen, Weiming, Zhu, Zhihan, Wang, Yijia, He, Zhihai
Rectified flow (RF) models have recently demonstrated superior generative performance compared to DDIM-based diffusion models. However, in real-world applications, they suffer from two major challenges: (1) low inversion accuracy that hinders the consistency with the source image, and (2) entangled multimodal attention in diffusion transformers, which hinders precise attention control. To address the first challenge, we propose an efficient high-order inversion method for rectified flow models based on the Runge-Kutta solver of differential equations. To tackle the second challenge, we introduce Decoupled Diffusion Transformer Attention (DDTA), a novel mechanism that disentangles text and image attention inside the multimodal diffusion transformers, enabling more precise semantic control. Extensive experiments on image reconstruction and text-guided editing tasks demonstrate that our method achieves state-of-the-art performance in terms of fidelity and editability. Code is available at https://github.com/wmchen/RKSovler_DDTA.
- Europe > Switzerland (0.04)
- Asia > Singapore (0.04)
- North America > United States (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Sensing and Signal Processing > Image Processing (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
TFANet: Three-Stage Image-Text Feature Alignment Network for Robust Referring Image Segmentation
Lu, Qianqi, Xie, Yuxiang, Zhang, Jing, Zou, Shiwei, Chen, Yan, Luan, Xidao
Referring Image Segmentation (RIS) is a task that segments image regions based on language expressions, requiring fine-grained alignment between two modalities. However, existing methods often struggle with multimodal misalignment and language semantic loss, especially in complex scenes containing multiple visually similar objects, where uniquely described targets are frequently mislocalized or incompletely segmented. To tackle these challenges, this paper proposes TFANet, a Three-stage Image-Text Feature Alignment Network that systematically enhances multimodal alignment through a hierarchical framework comprising three stages: Knowledge Plus Stage (KPS), Knowledge Fusion Stage (KFS), and Knowledge Intensification Stage (KIS). In the first stage, we design the Multiscale Linear Cross-Attention Module (MLAM), which facilitates bidirectional semantic exchange between visual features and textual representations across multiple scales. This establishes rich and efficient alignment between image regions and different granularities of linguistic descriptions. Subsequently, the KFS further strengthens feature alignment through the Cross-modal Feature Scanning Module (CFSM), which applies multimodal selective scanning to capture long-range dependencies and construct a unified multimodal representation. This is essential for modeling long-range cross-modal dependencies and enhancing alignment accuracy in complex scenes. Finally, in the KIS, we propose the Word-level Linguistic Feature-guided Semantic Deepening Module (WFDM) to compensate for semantic degradation introduced in earlier stages.
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Switzerland (0.04)
- Asia > China (0.04)
- (2 more...)
- Research Report (1.00)
- Overview (0.93)
Maestro: Self-Improving Text-to-Image Generation via Agent Orchestration
Wan, Xingchen, Zhou, Han, Sun, Ruoxi, Nakhost, Hootan, Jiang, Ke, Sinha, Rajarishi, Arık, Sercan Ö.
Text-to-image (T2I) models, while offering immense creative potential, are highly reliant on human intervention, posing significant usability challenges that often necessitate manual, iterative prompt engineering over often underspecified prompts. This paper introduces Maestro, a novel self-evolving image generation system that enables T2I models to autonomously self-improve generated images through iterative evolution of prompts, using only an initial prompt. Maestro incorporates two key innovations: 1) self-critique, where specialized multimodal LLM (MLLM) agents act as 'critics' to identify weaknesses in generated images, correct for under-specification, and provide interpretable edit signals, which are then integrated by a 'verifier' agent while preserving user intent; and 2) self-evolution, utilizing MLLM-as-a-judge for head-to-head comparisons between iteratively generated images, eschewing problematic images, and evolving creative prompt candidates that align with user intents. Extensive experiments on complex T2I tasks using black-box models demonstrate that Maestro significantly improves image quality over initial prompts and state-of-the-art automated methods, with effectiveness scaling with more advanced MLLM components. This work presents a robust, interpretable, and effective pathway towards self-improving T2I generation.
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- North America > Mexico > Gulf of Mexico (0.04)
- Africa > Middle East > Egypt (0.04)